fix(api): close private dataset doc auth bypass by Parvezkhan0 · Pull Request #14749 · infiniflow/ragflow

Parvezkhan0 · 2026-05-10T13:31:31Z

What problem does this PR solve?

This PR fixes an authorization bypass in the token-authenticated document SDK APIs for private datasets (permission = me).
The affected routes validated dataset access with KnowledgebaseService.accessible(), but when a login token was used through the SDK token flow, the check received the caller's tenant/workspace id instead of the authenticated user id. That let another member of the same tenant pass access checks for a private dataset if they knew the dataset_id, exposing document listings and allowing document operations that should have remained owner-only.

This change forwards the authenticated user id through the token-auth flow and uses it for dataset authorization in the affected SDK document endpoints, so private datasets remain accessible only to their owner while team datasets continue to work as before.

Type of change

Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-05-10T13:31:48Z

📝 Walkthrough

Walkthrough

Routes in api/apps/sdk/doc.py now accept optional authenticated_user_id and use a new _dataset_access_actor_id helper to derive actor_id (prefer authenticated_user_id). Route authorization calls to KnowledgebaseService.accessible(...) use that actor_id. token_required inspects handler signatures and injects authenticated_user_id when resolving login-token users. Tests updated/added to verify behavior.

Changes

User-Based Authorization for SDK Routes

Layer / File(s)	Summary
Authorization identity and route signatures `api/apps/sdk/doc.py`	Adds `_dataset_access_actor_id(tenant_id, authenticated_user_id)` and updates `download`, `parse`, `stop_parsing`, `retrieval_test` signatures to accept `authenticated_user_id: str \| None = None`.
Route authorization calls `api/apps/sdk/doc.py`	`download`, `parse`, `stop_parsing`, and `retrieval_test` compute actor via helper and pass it as `user_id` to `KnowledgebaseService.accessible(...)` instead of using `tenant_id`.
Authentication decorator `api/utils/api_utils.py`	`token_required` inspects the wrapped handler signature for `authenticated_user_id` and conditionally injects it into kwargs when resolving a login-token JWT user.
Tests & Verification `test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`, `test/unit_test/api/utils/test_api_utils_token_required.py`	Updated `test_download_and_download_doc_errors` to use KnowledgebaseService.accessible; added `test_sdk_routes_use_authenticated_user_for_dataset_access` to assert routes pass authenticated_user_id; added unit test validating `token_required` injects authenticated_user_id for login tokens.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

#14659: Bug report describing TDocument APIs leaking private datasets to tenant members — this PR supplies per-user actor wiring so accessible(...) can enforce owner-only access.

Possibly related PRs

infiniflow/ragflow#14645: Changes KnowledgebaseService.accessible() and DocumentService.accessible() to enforce per-dataset permission rules; complements this PR's actor wiring.
infiniflow/ragflow#14782: Related adjustments in api/apps/sdk/doc.py around retrieval_test authorization checks.
infiniflow/ragflow#14910: Modifies download-route authorization wiring related to tenant/user identity.

Suggested reviewers

wangq8
Lynn-Inf

Poem

🐰 I hop through code with nimble feet,
Passing user ids where checks must meet.
Tenant or user, the helper decides,
Now private datasets stay safe inside.
Hooray — secure access, carrot-feast!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing an authorization bypass in private dataset document access for SDK APIs.
Description check	✅ Passed	The PR description follows the template with both required sections completed: a clear problem statement with issue reference and type of change (Bug Fix) is provided.
Linked Issues check	✅ Passed	The PR directly addresses issue `#14659` by forwarding authenticated user IDs through token-auth flows and using them for dataset authorization checks, enforcing owner-only access for private datasets.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing the authorization bypass: helper function for actor ID derivation, decorator enhancements to pass authenticated user ID, route signature updates, and comprehensive tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

api/apps/sdk/doc.py (1)

53-55: ⚡ Quick win

Add a docstring to explain the access control logic.

The function handles important authorization semantics (JWT-authenticated user vs API-key tenant fallback) without documentation. The suggested docstring clarifies when each path is used:

📝 Suggested docstring

 def _dataset_access_actor_id(tenant_id: str, authenticated_user_id: str | None = None) -> str:
+    """
+    Determine the actor ID for dataset access authorization.
+    
+    Returns authenticated_user_id when available (JWT/login-token flow),
+    otherwise falls back to tenant_id (API-key flow or unauthenticated).
+    """
     return authenticated_user_id or tenant_id

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 53 - 55, Add a clear docstring to
_dataset_access_actor_id describing the access-control semantics: explain that
when a JWT-authenticated user ID is present the function returns that user (used
for per-user authorization/audit), and when authenticated_user_id is None it
falls back to the tenant_id (used for API-key based requests or tenant-scoped
actions); mention expected parameter types and the function's return value and
include a short example of both paths for clarity.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/sdk/doc.py`:
- Around line 208-211: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.
- Around line 302-305: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.
- Around line 432-435: The dataset access check loop lacks logging; add
structured logs using the existing actor id and kb_ids: log a single INFO-level
message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.
- Around line 97-100: Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.

In `@api/utils/api_utils.py`:
- Around line 339-340: The code path that injects authenticated_user_id into
kwargs (the block checking accepts_authenticated_user_id and setting
kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a
concise log statement (using the module logger or existing logger variable)
immediately before or after the assignment that records the action and key
context such as the injected user id and the target dataset/request identifier
(if available) and use an appropriate level (info/debug) while avoiding
sensitive data; update any import or logger initialization (e.g., logger =
logging.getLogger(__name__)) if not already present.

---

Nitpick comments:
In `@api/apps/sdk/doc.py`:
- Around line 53-55: Add a clear docstring to _dataset_access_actor_id
describing the access-control semantics: explain that when a JWT-authenticated
user ID is present the function returns that user (used for per-user
authorization/audit), and when authenticated_user_id is None it falls back to
the tenant_id (used for API-key based requests or tenant-scoped actions);
mention expected parameter types and the function's return value and include a
short example of both paths for clarity.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 272a4cf4-6bed-4bf2-a7f7-55ab6fde80c1

📥 Commits

Reviewing files that changed from the base of the PR and between 6bfe0f9 and 60e606a.

📒 Files selected for processing (4)

api/apps/sdk/doc.py
api/utils/api_utils.py
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
test/unit_test/api/utils/test_api_utils_token_required.py

coderabbitai · 2026-05-10T13:36:16Z

+    if not KnowledgebaseService.accessible(
+        kb_id=dataset_id,
+        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+    ):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This security-critical authorization check determines access to private datasets but lacks logging. Adding a log statement would improve security audit trails and debugging.

🔒 Proposed logging addition

+ actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) + logging.debug("Checking dataset access: dataset_id=%s actor_id=%s (authenticated_user_id=%s)", dataset_id, actor_id, authenticated_user_id) if not KnowledgebaseService.accessible( kb_id=dataset_id, - user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id), + user_id=actor_id, ): + logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", dataset_id, actor_id) return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 97 - 100, Add structured logging around the KnowledgebaseService.accessible check: log the attempted access with dataset_id, the actor returned by _dataset_access_actor_id(tenant_id, authenticated_user_id), and the authorization result (allowed/denied). Use the module logger (e.g., logger or security logger) and emit a warning or info-level entry when access is denied (include tenant_id and authenticated_user_id as context), so the authorization decision for KnowledgebaseService.accessible is recorded for audit and debugging.

coderabbitai · 2026-05-10T13:36:16Z

+    if not KnowledgebaseService.accessible(
+        kb_id=dataset_id,
+        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+    ):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download function, this authorization check lacks logging. Consider adding the same logging pattern here for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 208 - 211, The authorization check using KnowledgebaseService.accessible(kb_id=dataset_id, user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks logging; add logging similar to the download flow: log an info/debug message before the check indicating the dataset_id and actor_id being validated and log a warning/error when access is denied (including dataset_id and actor_id) so operators can trace authorization failures; update the same function that performs this check to call the logger and follow the existing logging format used by the download function.

coderabbitai · 2026-05-10T13:36:16Z

+    if not KnowledgebaseService.accessible(
+        kb_id=dataset_id,
+        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+    ):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download and parse functions, this authorization check lacks logging. Consider adding the same logging pattern for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 302 - 305, The authorization check using KnowledgebaseService.accessible(kb_id=dataset_id, user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing logging; add the same logging pattern used in the download/parse flows to record the check input (dataset_id, tenant_id, authenticated_user_id) and the authorization outcome. Specifically, before/around the if, emit a log entry (matching the existing logger level/format used in download/parse) that includes dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether access was granted/denied so the decision is auditable and consistent with other flows.

coderabbitai · 2026-05-10T13:36:16Z

+    actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
    for id in kb_ids:
-        if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):
+        if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
            return get_error_data_result(f"You don't own the dataset {id}.")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This function checks access to potentially multiple datasets but lacks logging. Since this loops over dataset IDs, consider logging once before the loop (with the full list) and/or within the loop for denied access.

🔍 Proposed logging addition

actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) + logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id) for id in kb_ids: if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id): + logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id) return get_error_data_result(f"You don't own the dataset {id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)

for id in kb_ids:

if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):

if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):

return get_error_data_result(f"You don't own the dataset {id}.")

actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)

logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)

for id in kb_ids:

if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):

logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)

return get_error_data_result(f"You don't own the dataset {id}.")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 432 - 435, The dataset access check loop lacks logging; add structured logs using the existing actor id and kb_ids: log a single INFO-level message before the loop mentioning kb_ids and actor_id (from _dataset_access_actor_id), and inside the loop log a WARN/ERROR when KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False including the denied kb id and actor_id (and authenticated_user_id if available) before returning get_error_data_result so denied access events are recorded for debugging and audit.

coderabbitai · 2026-05-10T13:36:16Z

+                    if accepts_authenticated_user_id:
+                        kwargs["authenticated_user_id"] = user[0].id


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for the authenticated user injection flow.

This security-relevant code path injects the authenticated user ID for dataset authorization checks but lacks logging. Adding a log statement would improve observability and audit trails for private dataset access.

📊 Proposed logging addition

kwargs["tenant_id"] = tenants[0].tenant_id if accepts_authenticated_user_id: + logging.debug("JWT authentication: injecting authenticated_user_id=%s for tenant_id=%s", user[0].id, tenants[0].tenant_id) kwargs["authenticated_user_id"] = user[0].id

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/utils/api_utils.py` around lines 339 - 340, The code path that injects authenticated_user_id into kwargs (the block checking accepts_authenticated_user_id and setting kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a concise log statement (using the module logger or existing logger variable) immediately before or after the assignment that records the action and key context such as the injected user id and the target dataset/request identifier (if available) and use an appropriate level (info/debug) while avoiding sensitive data; update any import or logger initialization (e.g., logger = logging.getLogger(__name__)) if not already present.

Copilot

Pull request overview

This PR fixes an authorization bypass in SDK document endpoints when callers authenticate using a login token: the SDK auth layer now forwards the authenticated user id (in addition to tenant/workspace id), and the SDK document routes use that user id for dataset authorization so permission = me datasets remain owner-only.

Changes:

Update token_required to optionally inject authenticated_user_id for login-token authentication when the wrapped handler accepts it.
Update SDK document routes to authorize dataset access using the authenticated user id (falling back to tenant id when not available).
Add/adjust unit tests to verify correct user-id propagation and that SDK document routes pass the correct actor id into KnowledgebaseService.accessible().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`api/utils/api_utils.py`	Enhances `token_required` to conditionally pass `authenticated_user_id` for login-token flows.
`api/apps/sdk/doc.py`	Uses authenticated user id for dataset access checks in SDK document endpoints (download/parse/stop/retrieval).
`test/unit_test/api/utils/test_api_utils_token_required.py`	New unit test validating `token_required` injects authenticated user id for login tokens.
`test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`	Updates existing unit tests and adds coverage ensuring SDK routes call dataset access checks with `user_id=authenticated_user_id`.

coderabbitai

♻️ Duplicate comments (1)

api/apps/sdk/doc.py (1)

98-102: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add audit logging around the new authorization decision points.

The new actor-based access checks return early on denial but still don’t log the decision path. Please add a consistent pre-check and denied-check log for these paths.

🔧 Suggested pattern

+    actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+    logging.debug(
+        "sdk.dataset_access.check dataset_id=%s actor_id=%s tenant_id=%s authenticated_user_id=%s",
+        dataset_id, actor_id, tenant_id, authenticated_user_id
+    )
-    if not KnowledgebaseService.accessible(
-        kb_id=dataset_id,
-        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
-    ):
+    if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=actor_id):
+        logging.warning(
+            "sdk.dataset_access.denied dataset_id=%s actor_id=%s tenant_id=%s",
+            dataset_id, actor_id, tenant_id
+        )
         return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

Also applies to: 220-223, 314-317, 444-447

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 98 - 102, Add audit logging around the
KnowledgebaseService.accessible authorization checks: before calling
KnowledgebaseService.accessible (in the block using _dataset_access_actor_id and
any similar checks at the other noted locations) log a pre-check audit entry
with tenant_id, authenticated_user_id, dataset_id and the actor id being used;
if accessible returns False, log a denied audit entry with the same context and
the decision reason, then return get_error_data_result as before. Use the same
logger used elsewhere in this module and include function names or request ids
where available to correlate logs (reference: KnowledgebaseService.accessible,
_dataset_access_actor_id, get_error_data_result).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@api/apps/sdk/doc.py`:
- Around line 98-102: Add audit logging around the
KnowledgebaseService.accessible authorization checks: before calling
KnowledgebaseService.accessible (in the block using _dataset_access_actor_id and
any similar checks at the other noted locations) log a pre-check audit entry
with tenant_id, authenticated_user_id, dataset_id and the actor id being used;
if accessible returns False, log a denied audit entry with the same context and
the decision reason, then return get_error_data_result as before. Use the same
logger used elsewhere in this module and include function names or request ids
where available to correlate logs (reference: KnowledgebaseService.accessible,
_dataset_access_actor_id, get_error_data_result).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1ea4335b-e40e-4c1a-975b-01a51b72f698

📥 Commits

Reviewing files that changed from the base of the PR and between 60e606a and 56d716d.

📒 Files selected for processing (2)

api/apps/sdk/doc.py
api/utils/api_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

api/utils/api_utils.py

codecov · 2026-05-19T03:20:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (db9e782) to head (56d716d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #14749   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          662      662           
  Misses         25       25           
  Partials       16       16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

KevinHuSh · 2026-05-19T03:34:49Z

Appreciations!

Does this mean that all the functions using @token_required needs to delcare one more parameter authenticated_user_id?

`POST /api/v1/dify/retrieval` resolved the caller via @apikey_required (injecting `tenant_id`) but then fetched the requested `knowledge_id` with no tenant filter and ran the full retrieval pipeline against `kb.tenant_id` (the owner). Any valid Dify-compatible API key could retrieve chunks from any tenant whose KB UUID was known. Mirror the pattern PR infiniflow#14749 adds to the sibling sdk/doc.py routes: after the existing `KnowledgebaseService.get_by_id` lookup, call `KnowledgebaseService.accessible(kb_id, tenant_id)` and return `RetCode.AUTHENTICATION_ERROR` ("No authorization.") when it returns False. No behavior change for owners or for team members already allowed by the existing accessible() rules. Closes infiniflow#15027

POST /api/v1/dify/retrieval resolved the caller via @apikey_required (injecting tenant_id) but then fetched the requested knowledge_id with no tenant filter and ran the full retrieval pipeline against kb.tenant_id (the owner). Any valid Dify-compatible API key could retrieve chunks from any tenant whose KB UUID was known. Adds the missing ownership check. ## Root Cause api/apps/sdk/dify_retrieval.py line 253: KnowledgebaseService.get_by_id(kb_id) fetched the KB by id alone, then the handler used kb.tenant_id (the OWNER) to build the embedding model and call the retriever. The caller tenant_id was only used downstream at line 278 for retrieval_by_children, well after cross-tenant data was already retrieved. grep confirmed there was no KnowledgebaseService.accessible call anywhere in the handler. ## Fix Two-line guard immediately after the existing get_by_id lookup, mirroring the pattern PR #14749 lands for the sibling sdk/doc.py routes (download, parse, stop_parsing, retrieval_test): e, kb = KnowledgebaseService.get_by_id(kb_id) if not e: return build_error_result(message="Knowledgebase not found!", code=RetCode.NOT_FOUND) + if not KnowledgebaseService.accessible(kb_id, tenant_id): + return build_error_result(message="No authorization.", code=RetCode.AUTHENTICATION_ERROR) if kb.tenant_embd_id: ... KnowledgebaseService.accessible already handles solo-tenant ownership, team membership via TenantService.get_joined_tenants_by_user_id, and the permission=ME distinction. No behavior change for legitimate callers; cross-tenant callers now receive RetCode.AUTHENTICATION_ERROR (109). ## Test Plan - [x] Regression test added: test/unit_test/api/apps/sdk/test_dify_retrieval.py - test_cross_tenant_request_is_rejected -- attacker tenant calling owner tenant KB gets 109; retriever is not invoked - test_same_tenant_request_succeeds -- owner tenant gets the records back - test_missing_knowledge_base_returns_not_found -- missing KB returns 404 BEFORE the access check fires (legit callers see the clearer message) - [x] All 3 tests pass after the fix - [x] Cross-tenant test FAILS on pre-fix main (KeyError on result[code] because handler leaks records dict instead of returning auth error) - [x] ruff check clean on both changed files - [x] No drive-by reformatting in dify_retrieval.py -- only the 2 added lines ### Post-fix output test_cross_tenant_request_is_rejected PASSED [ 33%] test_same_tenant_request_succeeds PASSED [ 66%] test_missing_knowledge_base_returns_not_found PASSED [100%] ============================== 3 passed in 0.04s =============================== Closes #15027

fix(api): close private dataset doc auth bypass

60e606a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 10, 2026 13:31

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 10, 2026

dosubot Bot added 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels May 10, 2026

Copilot started reviewing on behalf of Parvezkhan0 May 10, 2026 13:32 View session

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Copilot AI reviewed May 10, 2026

View reviewed changes

KevinHuSh added the ci Continue Integration label May 19, 2026

Merge branch 'main' into fix/private-dataset-doc-auth-14659

56d716d

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

dripsmvcp mentioned this pull request May 20, 2026

fix(api): check kb ownership in /dify/retrieval #15028

Merged

5 tasks

JinHai-CN force-pushed the main branch from 721039c to 7783487 Compare May 20, 2026 05:26

		if accepts_authenticated_user_id:
		kwargs["authenticated_user_id"] = user[0].id

Conversation

Parvezkhan0 commented May 10, 2026 • edited by JinHai-CN Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 19, 2026

Codecov Report

Uh oh!

KevinHuSh commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Parvezkhan0 commented May 10, 2026 •

edited by JinHai-CN

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading

KevinHuSh commented May 19, 2026 •

edited

Loading